Setup notebook

The following libraries are used in for in this notebook:

# Load libraries
library(tidyverse)
library(tidymodels)
library(readr)
library(glmnet)
package 㤼㸱glmnet㤼㸲 was built under R version 4.0.3
library(leaps)
package 㤼㸱leaps㤼㸲 was built under R version 4.0.3
library(naniar)
package 㤼㸱naniar㤼㸲 was built under R version 4.0.3
library(skimr)
package 㤼㸱skimr㤼㸲 was built under R version 4.0.3
library(knitr)
library(corrplot)
package 㤼㸱corrplot㤼㸲 was built under R version 4.0.3

Load data listings

# Read csv with listing information
data <- read_csv(gzfile("listings.csv.gz"))
38 parsing failures.
  row     col           expected         actual         file
15026 license 1/0/T/F/TRUE/FALSE NL857416819B01 <connection>
15673 license 1/0/T/F/TRUE/FALSE NL825517485B01 <connection>
15961 license 1/0/T/F/TRUE/FALSE NL825517485B01 <connection>
16391 license 1/0/T/F/TRUE/FALSE 855596338B01   <connection>
16823 license 1/0/T/F/TRUE/FALSE NL854649426B01 <connection>
..... ....... .................. .............. ............
See problems(...) for more details.

Select variables of interest

The goal of the assignment is to build a model that predicts the prices of listings on AirBnB in Amsterdam. The outcomes of the model will be used for suggestions to the new hosts about the average platform price for similar listings. Then hosts can choose whether they want to use the recommendation to set their prices accordingly in order to be competitive and gain attention from the guests since the beginning. All variables including information on the reviews give information about a listing after it has been published. Therefore, this variables are not included in the data set. Moreover, the variables including a description and summary about the listing can be analyzed using NLP (e.g. sentiment analysis). However, this is beyond the scope of the assignment. Therefore, these variables are excluded from the model. The variables below are included in the data set for further analysis and cleaning.

# Generate subset with variables of interest
data_sub <- data %>%
  select(id, price, property_type, room_type, accommodates, bathrooms, bedrooms,
         beds, bed_type, amenities, host_since, host_response_time,
         host_response_rate, host_neighbourhood, host_listings_count, 
         host_verifications, host_identity_verified, neighbourhood_cleansed,
         square_feet, cleaning_fee, guests_included, extra_people, 
         minimum_nights, maximum_nights, availability_30, availability_60,
         availability_90, availability_365, instant_bookable, 
         cancellation_policy, require_guest_profile_picture,
         require_guest_phone_verification, calculated_host_listings_count,
         calculated_host_listings_count_entire_homes, 
         calculated_host_listings_count_private_rooms,
         calculated_host_listings_count_shared_rooms)
# Inspect data
head(data_sub)
<<<<<<< HEAD ======= >>>>>>> fc0442fd76290b6483426e318740d6bc774d6ae0
# Inspect data
skim(data_sub) %>% knit_print()
Name data_sub
Number of rows 20025
Number of columns 36
_______________________
Column type frequency:
character 13
Date 1
logical 4
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
price 0 1.00 5 9 0 479 0
property_type 0 1.00 3 22 0 34 0
room_type 0 1.00 10 15 0 4 0
bed_type 0 1.00 5 13 0 5 0
amenities 0 1.00 2 1303 0 19213 0
host_response_time 158 0.99 3 18 0 5 0
host_response_rate 158 0.99 2 4 0 61 0
host_neighbourhood 5972 0.70 4 35 0 81 0
host_verifications 0 1.00 2 158 0 381 0
neighbourhood_cleansed 0 1.00 4 38 0 22 0
cleaning_fee 3604 0.82 5 7 0 112 0
extra_people 0 1.00 5 7 0 112 0
cancellation_policy 0 1.00 8 27 0 5 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
host_since 158 0.99 2008-09-24 2019-12-06 2015-02-08 3133

Variable type: logical

skim_variable n_missing complete_rate mean count
host_identity_verified 158 0.99 0.39 FAL: 12094, TRU: 7773
instant_bookable 0 1.00 0.26 FAL: 14839, TRU: 5186
require_guest_profile_picture 0 1.00 0.01 FAL: 19819, TRU: 206
require_guest_phone_verification 0 1.00 0.01 FAL: 19758, TRU: 267

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 19117026.00 11473019.80 2818 9631819 18418621 27913527.0 40655209 ▇▇▇▅▆
accommodates 0 1.00 2.87 1.30 1 2 2 4.0 18 ▇▁▁▁▁
bathrooms 6 1.00 1.18 0.39 0 1 1 1.5 8 ▇▁▁▁▁
bedrooms 13 1.00 1.45 0.89 0 1 1 2.0 12 ▇▁▁▁▁
beds 31 1.00 1.79 1.41 0 1 1 2.0 32 ▇▁▁▁▁
host_listings_count 158 0.99 3.85 28.93 0 1 1 1.0 751 ▇▁▁▁▁
square_feet 19662 0.02 542.88 560.24 0 0 484 834.0 3229 ▇▅▁▁▁
guests_included 0 1.00 1.46 0.95 1 1 1 2.0 16 ▇▁▁▁▁
minimum_nights 0 1.00 3.43 14.74 1 2 2 3.0 1001 ▇▁▁▁▁
maximum_nights 0 1.00 613.89 548.62 1 20 1125 1125.0 11250 ▇▁▁▁▁
availability_30 0 1.00 4.46 7.81 0 0 0 6.0 30 ▇▁▁▁▁
availability_60 0 1.00 10.05 17.14 0 0 0 13.0 60 ▇▁▁▁▁
availability_90 0 1.00 15.69 26.77 0 0 0 19.0 90 ▇▁▁▁▁
availability_365 0 1.00 47.93 95.34 0 0 0 37.0 365 ▇▁▁▁▁
calculated_host_listings_count 0 1.00 1.97 5.20 1 1 1 1.0 72 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.50 5.06 0 1 1 1.0 72 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 0.38 1.08 0 0 0 0.0 16 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 0.01 0.09 0 0 0 0.0 3 ▇▁▁▁▁

Data cleaning

Basic cleaning

First, we converted all categorical en logical variables, that did not need any further cleaning, to type factors.

# Convert columns to factors 
data_sub$property_type <- factor(data_sub$property_type , 
                                 levels = unique(data_sub$property_type))
data_sub$room_type <- factor(data_sub$room_type , 
                             levels = unique(data_sub$room_type))
data_sub$bed_type <- factor(data_sub$bed_type , 
                    levels = unique(data_sub$bed_type))
data_sub$host_response_time <- factor(data_sub$ host_response_time, 
                    levels = unique(data_sub$host_response_time))
data_sub$host_neighbourhood <- factor(data_sub$host_neighbourhood, 
                    levels = unique(data_sub$host_neighbourhood))
data_sub$neighbourhood_cleansed <- factor(data_sub$neighbourhood_cleansed, 
                    levels = unique(data_sub$neighbourhood_cleansed))
data_sub$cancellation_policy <- factor(data_sub$cancellation_policy , 
                    levels = unique(data_sub$cancellation_policy))

# Convert logical variables to factors
data_sub$host_identity_verified <- factor(data_sub$host_identity_verified)
data_sub$instant_bookable <- factor(data_sub$instant_bookable)
data_sub$require_guest_profile_picture <- 
  factor(data_sub$require_guest_profile_picture)
data_sub$require_guest_phone_verification <- 
  factor(data_sub$require_guest_phone_verification)

Second, some of the variables contain numeric variables, however, they are stored in a string containing a dollar or percentage sign. The signs we removed from the strings and the remaining number converted to numeric variables.

# Remove $ sign from columns containing prices and convert to doubles
data_sub$price <- as.double(gsub("[,$]", "", data_sub$price))
data_sub$cleaning_fee <- as.double(gsub("[,$]", "", data_sub$cleaning_fee))
data_sub$extra_people <- as.double(gsub("[,$]", "", data_sub$extra_people))

# Replace "N/A" values, remove % and convert to percentage 
data_sub$host_response_rate <- na_if(data_sub$host_response_rate, "N/A")
data_sub$host_response_rate <- 
  as.double(gsub("[%]", "", data_sub$host_response_rate)) / 100

Clean amenities

The variable amenities contains all the amenities of the listing. However, this was stored in one large string and the elements could not be access separately. Therefore, we cleaned the string, spitted it so we had a list containing the separated elements. Furthermore, we extracted all unique amenities and stored these in a vector.

# Clean and split strings for amenities
# Returns a list with all unique values
clean_amenities <- function(x) {
  subbed <- gsub('[{}\"]', "", tolower(x))
  splitted <- str_split(subbed, ",")
  clean <- sapply(splitted, function(x) str_trim(x, side = "both"))
  return(clean)
}

# Clean amenities
data_sub$amenities_clean <- 
  sapply(data_sub$amenities, function(x) clean_amenities(x))

# Create vector with all unique amenities
amenities_unique = c()
for(amenities in data_sub$amenities_clean) {
  for(element in amenities) {
    if(!(element %in% amenities_unique) & element != "") {
      amenities_unique <- append(amenities_unique, str_trim(element))
    }
  }
}

# Show result
amenities_unique
<<<<<<< HEAD
  [1] "internet"                                   "wifi"                                       "paid parking off premises"                 
  [4] "buzzer/wireless intercom"                   "heating"                                    "washer"                                    
  [7] "smoke detector"                             "carbon monoxide detector"                   "first aid kit"                             
 [10] "safety card"                                "fire extinguisher"                          "essentials"                                
 [13] "shampoo"                                    "lock on bedroom door"                       "24-hour check-in"                          
 [16] "hangers"                                    "hair dryer"                                 "iron"                                      
 [19] "laptop friendly workspace"                  "translation missing: en.hosting_amenity_49" "translation missing: en.hosting_amenity_50"
 [22] "private entrance"                           "hot water"                                  "bed linens"                                
 [25] "extra pillows and blankets"                 "single level home"                          "garden or backyard"                        
 [28] "no stairs or steps to enter"                "accessible-height bed"                      "host greets you"                           
 [31] "handheld shower head"                       "paid parking on premises"                   "tv"                                        
 [34] "refrigerator"                               "long term stays allowed"                    "cable tv"                                  
 [37] "kitchen"                                    "elevator"                                   "indoor fireplace"                          
 [40] "family/kid friendly"                        "dryer"                                      "private living room"                       
 [43] "well-lit path to entrance"                  "breakfast"                                  "self check-in"                             
 [46] "smart lock"                                 "lake access"                                "pets live on this property"                
 [49] "cat(s)"                                     "smoking allowed"                            "pets allowed"                              
 [52] "microwave"                                  "coffee maker"                               "dishwasher"                                
 [55] "dishes and silverware"                      "cooking basics"                             "oven"                                      
 [58] "stove"                                      "patio or balcony"                           "keypad"                                    
 [61] "luggage dropoff allowed"                    "baby bath"                                  "bathtub"                                   
 [64] "babysitter recommendations"                 "beach essentials"                           "cleaning before checkout"                  
 [67] "ev charger"                                 "pack ’n play/travel crib"                   "high chair"                                
 [70] "crib"                                       "children’s books and toys"                  "room-darkening shades"                     
 [73] "children’s dinnerware"                      "free street parking"                        "other"                                     
 [76] "extra space around bed"                     "wheelchair accessible"                      "pocket wifi"                               
 [79] "wide hallways"                              "waterfront"                                 "washer / dryer"                            
 [82] "window guards"                              "air conditioning"                           "suitable for events"                       
 [85] "free parking on premises"                   "lockbox"                                    "wide entrance for guests"                  
 [88] "bbq grill"                                  "ground floor access"                        "dog(s)"                                    
 [91] "hot tub"                                    "wide entrance"                              "accessible-height toilet"                  
 [94] "wide entryway"                              "hot water kettle"                           "ethernet connection"                       
 [97] "flat path to guest entrance"                "wide doorway to guest bathroom"             "wide clearance to shower"                  
[100] "toilet"                                     "step-free shower"                           "gym"                                       
[103] "outlet covers"                              "firm mattress"                              "changing table"                            
[106] "stair gates"                                "fireplace guards"                           "table corner guards"                       
[109] "game console"                               "baby monitor"                               "doorman"                                   
[112] "building staff"                             "pool"                                       "other pet(s)"                              
[115] "fixed grab bars for shower"                 "disabled parking spot"                      "electric profiling bed"                    
[118] "bathtub with bath chair"                    "shower gel"                                 "fixed grab bars for toilet"                
[121] "beachfront"                                 "shower chair"                               "trash can"                                 
[124] "ski-in/ski-out"                             "mobile hoist"                               "pool with pool hoist"                      
[127] "ceiling hoist"                              "full kitchen"                               "private bathroom"                          
[130] "air purifier"                               "bread maker"                                "roll-in shower with chair"                 
=======
  [1] "internet"                                   "wifi"                                       "paid parking off premises"                  "buzzer/wireless intercom"                  
  [5] "heating"                                    "washer"                                     "smoke detector"                             "carbon monoxide detector"                  
  [9] "first aid kit"                              "safety card"                                "fire extinguisher"                          "essentials"                                
 [13] "shampoo"                                    "lock on bedroom door"                       "24-hour check-in"                           "hangers"                                   
 [17] "hair dryer"                                 "iron"                                       "laptop friendly workspace"                  "translation missing: en.hosting_amenity_49"
 [21] "translation missing: en.hosting_amenity_50" "private entrance"                           "hot water"                                  "bed linens"                                
 [25] "extra pillows and blankets"                 "single level home"                          "garden or backyard"                         "no stairs or steps to enter"               
 [29] "accessible-height bed"                      "host greets you"                            "handheld shower head"                       "paid parking on premises"                  
 [33] "tv"                                         "refrigerator"                               "long term stays allowed"                    "cable tv"                                  
 [37] "kitchen"                                    "elevator"                                   "indoor fireplace"                           "family/kid friendly"                       
 [41] "dryer"                                      "private living room"                        "well-lit path to entrance"                  "breakfast"                                 
 [45] "self check-in"                              "smart lock"                                 "lake access"                                "pets live on this property"                
 [49] "cat(s)"                                     "smoking allowed"                            "pets allowed"                               "microwave"                                 
 [53] "coffee maker"                               "dishwasher"                                 "dishes and silverware"                      "cooking basics"                            
 [57] "oven"                                       "stove"                                      "patio or balcony"                           "keypad"                                    
 [61] "luggage dropoff allowed"                    "baby bath"                                  "bathtub"                                    "babysitter recommendations"                
 [65] "beach essentials"                           "cleaning before checkout"                   "ev charger"                                 "pack ’n play/travel crib"                  
 [69] "high chair"                                 "crib"                                       "children’s books and toys"                  "room-darkening shades"                     
 [73] "children’s dinnerware"                      "free street parking"                        "other"                                      "extra space around bed"                    
 [77] "wheelchair accessible"                      "pocket wifi"                                "wide hallways"                              "waterfront"                                
 [81] "washer / dryer"                             "window guards"                              "air conditioning"                           "suitable for events"                       
 [85] "free parking on premises"                   "lockbox"                                    "wide entrance for guests"                   "bbq grill"                                 
 [89] "ground floor access"                        "dog(s)"                                     "hot tub"                                    "wide entrance"                             
 [93] "accessible-height toilet"                   "wide entryway"                              "hot water kettle"                           "ethernet connection"                       
 [97] "flat path to guest entrance"                "wide doorway to guest bathroom"             "wide clearance to shower"                   "toilet"                                    
[101] "step-free shower"                           "gym"                                        "outlet covers"                              "firm mattress"                             
[105] "changing table"                             "stair gates"                                "fireplace guards"                           "table corner guards"                       
[109] "game console"                               "baby monitor"                               "doorman"                                    "building staff"                            
[113] "pool"                                       "other pet(s)"                               "fixed grab bars for shower"                 "disabled parking spot"                     
[117] "electric profiling bed"                     "bathtub with bath chair"                    "shower gel"                                 "fixed grab bars for toilet"                
[121] "beachfront"                                 "shower chair"                               "trash can"                                  "ski-in/ski-out"                            
[125] "mobile hoist"                               "pool with pool hoist"                       "ceiling hoist"                              "full kitchen"                              
[129] "private bathroom"                           "air purifier"                               "bread maker"                                "roll-in shower with chair"                 
>>>>>>> fc0442fd76290b6483426e318740d6bc774d6ae0

Create new variables based information stored in amenities

With the vector of all unique amenities we could create variables for all the separated variables. However, since there are 130 usable amenities it seemed beyond the scope of the assignment. Moreover, some of the amenities contain information that is also given by other variables or about other amenities. For example a private bathroom could also be a strong indicator that the listing is an entire house/apartment. Also, shampoo or shower gel could be strong indicators for the presence of a bathroom. Furthermore, some are amenities are very specific and apply only to a few or one house. However, we created variables some of the amenities we think could have an influence on the price of a listing. We created variables for wifi, pool, hot_tub and tv.

# Create variable for WIFI and add to data set
wifi <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("wifi" %in% data_sub$amenities_clean[[i]] | 
     "internet" %in% data_sub$amenities_clean[[i]]) {
    wifi[i] <-  "yes"
  } else {
    wifi[i] <-  "no"
  }
}
data_sub$wifi <- wifi
data_sub$wifi <- factor(data_sub$wifi, levels = c("yes", "no"))

# Create variable for pool and add to data set
pool <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("pool" %in% data_sub$amenities_clean[[i]] |
     "pool with pool hoist" %in% data_sub$amenities_clean[[i]]) {
    pool[i] <-  "yes"
  } else {
    pool[i] <-  "no"
  }
}
data_sub$pool <- pool
data_sub$pool <- factor(data_sub$pool, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
hot_tub <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("hot tub" %in% data_sub$amenities_clean[[i]]) {
    hot_tub[i] <-  "yes"
  } else {
    hot_tub[i] <-  "no"
  }
}
data_sub$hot_tub <- hot_tub
data_sub$hot_tub <- factor(data_sub$hot_tub, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
tv <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("tv" %in% data_sub$amenities_clean[[i]] |
     "cable tv" %in% data_sub$amenities_clean[[i]]) {
    tv[i] <-  "yes"
  } else {
    tv[i] <-  "no"
  }
}
data_sub$tv <- tv
data_sub$tv <- factor(data_sub$tv, levels = c("yes", "no"))

Clean host verification methods

For the variable host_verification the same applies as to amenities. We have applied the same cleaning method, first we cleaned and splitted the strings, whereafther we generated a vector with all unique amenities.

# Clean and split strings for host verification methods
# Returns a list with all unique values
clean_verficiations <- function(x) {
  subbed <- gsub("\\[|\\]", "", tolower(x))
  subbed_complete <- gsub("[']", "", subbed)
  splitted <- str_split(subbed_complete, ",")
  clean <- sapply(splitted, function(x) str_trim(x, side = "both"))
  return(clean)
}

# Clean host_verifications
data_sub$host_verifications_clean <- 
  sapply(data_sub$host_verifications, function(x) clean_verficiations(x))

# Generate list with all unique host verification methods
verifications_unique = c()
for(verifications in data_sub$host_verifications_clean) {
  for(element in verifications) {
    if(!(element %in% verifications_unique) & element != "") {
      verifications_unique <- append(verifications_unique, str_trim(element))
    }
  }
}

# Show result
verifications_unique
<<<<<<< HEAD
 [1] "email"                 "phone"                 "reviews"               "jumio"                 "offline_government_id"
 [6] "selfie"                "government_id"         "identity_manual"       "facebook"              "work_email"           
[11] "none"                  "google"                "manual_offline"        "manual_online"         "sent_id"              
[16] "kba"                   "weibo"                 "zhima_selfie"          "sesame"                "sesame_offline"       
=======
 [1] "email"                 "phone"                 "reviews"               "jumio"                 "offline_government_id" "selfie"                "government_id"         "identity_manual"       "facebook"             
[10] "work_email"            "none"                  "google"                "manual_offline"        "manual_online"         "sent_id"               "kba"                   "weibo"                 "zhima_selfie"         
[19] "sesame"                "sesame_offline"       
>>>>>>> fc0442fd76290b6483426e318740d6bc774d6ae0

Create variables based on information stored in host verifications

We created separate variables for the most common methods of verification, namely email, phone, facebook and government_id.

# Create variable for host email and add to data set
host_email <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("email" %in% data_sub$host_verifications_clean[[i]]) {
    host_email[i] <-  "yes"
  } else {
    host_email[i] <-  "no"
  }
}
data_sub$host_email <- host_email 
data_sub$host_email <- factor(data_sub$host_email, levels = c("yes", "no"))

# Create variable for phone and add to data set 
host_phone <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("phone" %in% data_sub$host_verifications_clean[[i]]) {
    host_phone[i] <-  "yes"
  } else {
    host_phone[i] <-  "no"
  }
}
data_sub$host_phone <- host_phone 
data_sub$host_phone <- factor(data_sub$host_phone, levels = c("yes", "no"))
 
# Create variable for host facebook and add to data set
host_facebook <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("facebook" %in% data_sub$host_verifications_clean[[i]]) {
    host_facebook[i] <-  "yes"
  } else {
    host_facebook[i] <-  "no"
  }
}
data_sub$host_facebook <- host_facebook 
data_sub$host_facebook <- 
  factor(data_sub$host_facebook, levels = c("yes", "no"))

# Create variable for government id 
host_government_id <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("government_id" %in% data_sub$host_verifications_clean[[i]]) {
    host_government_id[i] <-  "yes"
  } else {
    host_government_id[i] <-  "no"
  }
}
data_sub$host_government_id <- host_government_id 
data_sub$host_government_id <- 
  factor(data_sub$host_government_id, levels = c("yes", "no"))

Clean date variables

We used the variable host_since to create a new variable host_years_active, which contains information on the number of years a host has been active on the platform.

# Create new variable for active years host 
data_sub <- data_sub %>% 
  mutate(host_years_active = 
           as.double(as.Date("2019-12-07") - host_since) / 365)

Inspect availability variables

The variables availability_30, availability_60, availability_90 and availability_365 carry some of the same information. In order to inspect if we should include all variables in the data set, we plotted a correlation matrix. The plot below shows that all variables indicating the availability are strongly correlated. Therefore, we only include the variable availability_30 for further analysis.

# Plot correlation matrix
data_sub %>% select_if(is.numeric) %>% 
  select(availability_30, availability_60, availability_90, availability_365) %>% 
  cor() %>% corrplot()


# Remove other availability variables from data set
data_sub <- 
  data_sub %>% 
  select(-availability_60, -availability_90, -availability_365)

Check missing data

We have create a table to inspect with variables have missing cases and how many. The table shows the variables in the data set that contain missing values (in descending order). The table shows that the variable square_feet has \(19662\) missing cases, which is about \(98.19\%\). If we would deleted the missing cases, the data set will barely contain any data. Moreover, other methods for handling missing values like replacing NA-values with the mean or median would not be appropriate since the variables will be based on only \(1.81\%\) of the data. Therefore, square_feet is nog included in the final. The variable host_response_rate has \(9349\) missing cases, which is about \(46.69\%\). The variable host_neighbourhood has \(5972\), which is about \(29.82\%\). The variable cleaning_fee has \(3604\), which is about \(18.00\%\). The variables host_response_rate, host_neighbourhood and cleaning_fee do not have as many missing values as square_feet, however, the same reasoning applies. As a result, these variables are also excluded from the analyses.

# Count missing cases per variable
na_counter <- sapply(data_sub, function(x) sum(is.na(x)))
vars <- colnames(data_sub)

# Extract all variables with NA-values
na_values <- tibble(variables = vars, na_count = na_counter) 

# Check na count per variable
na_values %>%
  filter(na_count > 0) %>%
  arrange(desc(na_count))

Moreover, we have check how many cases contain missing values if the other values that have missing cases would be included in the ‘final’ data set. The table below shows that there are 206 cases which contain missing values, which is about \(1.03\%\) of the entire data set. Since this is a very small proportion it is not very likely that deleting these causes would have a large impact on the predictions. Therefore, we delete the cases with missing values.

# Compute incomplete rows
data_sub %>% 
  select(host_since, host_response_time, host_listings_count,
         host_identity_verified, beds, bedrooms, host_years_active, 
         bathrooms) %>% 
  complete.cases() %>% 
  summary(count())
   Mode   FALSE    TRUE 
logical     206   19819 
# Select variables for data set
variables_analysis <-   
  na_values %>% 
  filter(na_count <= 158) %>% 
  select(variables) %>% 
  pull(variables)

# Create final data set
data_semi_final <- data_sub %>% select(all_of(variables_analysis))
data_semi_final <- data_semi_final %>% 
  select(-c(amenities, amenities_clean, host_verifications,
            host_verifications_clean, host_since))
data_final <- data_semi_final[complete.cases(data_semi_final), ]

Create train-test split

For further analysis the data is split into a train-test set.

# Create a train-split sets
seed_x <-  123
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)

Inspecting the predicted variable

In order to prevent data leakages we only inspect the predicted variable price in the training set. In order to inspect price we have created a distribution plot. The plot shows that the data set contains some outliers and that the distribution is rightly skewed. Both the outliers and the skeweness make the data less interpretable and this could have an influence on performance of the models.

# Plot distribution price
ggplot(data = data_train , aes(price)) +
  geom_histogram(col="black",
                 breaks=seq(0, max(data_train$price), by=75),
                 aes(fill=..count..)) +
  labs(title="Histogram for Price", x="Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red")

In order to prevent this potential problems we have created a new distribution plot with a log transformed variable price. The plot shows that the distribution is less skewed and does not contain any large outliers. Resulting, the data that is more interpretable. Therefore, log transforming we will use the log transformed price for our models.

# Plot distribution price
ggplot(data = data_train , aes(log(price + 1))) +
  geom_histogram(col="black",
                 aes(fill=..count..)) +
  labs(title="Histogram for Ln Price", x="Ln Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red")

# Log transform the price in for both training and test set
data_final$price <- log(data_final$price + 1)

# Resplit the data using the same seed 
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)

K-fold cross validation

Moreover, we have generated 10-fold cross validation sets.

# Generate 10-fold CV sets
set.seed(321)
data_folds <- vfold_cv(data_train, v = 10)
data_folds
#  10-fold cross-validation 
<<<<<<< HEAD ======= >>>>>>> fc0442fd76290b6483426e318740d6bc774d6ae0

Remove data frames to avoid leakages and errors

rm(data)
rm(data_sub)
rm(data_semi_final)
rm(na_values)
---
title: "R Notebook"
output: html_notebook
---

# Setup notebook

The following libraries are used in for in this notebook:
```{r message = FALSE}
# Load libraries
library(tidyverse)
library(tidymodels)
library(readr)
library(glmnet)
library(leaps)
library(naniar)
library(skimr)
library(knitr)
library(corrplot)
```

# Load data listings
```{r message = FALSE}
# Read csv with listing information
data <- read_csv(gzfile("listings.csv.gz"))
```

# Select variables of interest
The goal of the assignment is to build a model that predicts the prices of listings on AirBnB in Amsterdam. The outcomes of the model will be used for suggestions to the new hosts about the average platform price for similar listings. Then hosts can choose whether they want to use the recommendation to set their prices accordingly in order to be competitive and gain attention from the guests since the beginning. All variables including information on the reviews give information about a listing after it has been published. Therefore, this variables are not included in the data set. Moreover, the variables including a description and summary about the listing can be analyzed using NLP (e.g. sentiment analysis). However, this is beyond the scope of the assignment. Therefore, these variables are excluded from the model. The variables below are included in the data set for further analysis and cleaning.

```{r message = FALSE}
# Generate subset with variables of interest
data_sub <- data %>%
  select(id, price, property_type, room_type, accommodates, bathrooms, bedrooms,
         beds, bed_type, amenities, host_since, host_response_time,
         host_response_rate, host_neighbourhood, host_listings_count, 
         host_verifications, host_identity_verified, neighbourhood_cleansed,
         square_feet, cleaning_fee, guests_included, extra_people, 
         minimum_nights, maximum_nights, availability_30, availability_60,
         availability_90, availability_365, instant_bookable, 
         cancellation_policy, require_guest_profile_picture,
         require_guest_phone_verification, calculated_host_listings_count,
         calculated_host_listings_count_entire_homes, 
         calculated_host_listings_count_private_rooms,
         calculated_host_listings_count_shared_rooms)
```

```{r}
# Inspect data
head(data_sub)
```

```{r}
# Inspect data
skim(data_sub) %>% knit_print()
```

# Data cleaning

## Basic cleaning

First, we converted all categorical en logical variables, that did not need any further cleaning, to type factors.
```{r Converting vectors}
# Convert columns to factors 
data_sub$property_type <- factor(data_sub$property_type , 
                                 levels = unique(data_sub$property_type))
data_sub$room_type <- factor(data_sub$room_type , 
                             levels = unique(data_sub$room_type))
data_sub$bed_type <- factor(data_sub$bed_type , 
                    levels = unique(data_sub$bed_type))
data_sub$host_response_time <- factor(data_sub$ host_response_time, 
                    levels = unique(data_sub$host_response_time))
data_sub$host_neighbourhood <- factor(data_sub$host_neighbourhood, 
                    levels = unique(data_sub$host_neighbourhood))
data_sub$neighbourhood_cleansed <- factor(data_sub$neighbourhood_cleansed, 
                    levels = unique(data_sub$neighbourhood_cleansed))
data_sub$cancellation_policy <- factor(data_sub$cancellation_policy , 
                    levels = unique(data_sub$cancellation_policy))

# Convert logical variables to factors
data_sub$host_identity_verified <- factor(data_sub$host_identity_verified)
data_sub$instant_bookable <- factor(data_sub$instant_bookable)
data_sub$require_guest_profile_picture <- 
  factor(data_sub$require_guest_profile_picture)
data_sub$require_guest_phone_verification <- 
  factor(data_sub$require_guest_phone_verification)
```

Second, some of the variables contain numeric variables, however, they are stored in a string containing a dollar or percentage sign. The signs we removed from the strings and the remaining number converted to numeric variables.
```{r Convert }
# Remove $ sign from columns containing prices and convert to doubles
data_sub$price <- as.double(gsub("[,$]", "", data_sub$price))
data_sub$cleaning_fee <- as.double(gsub("[,$]", "", data_sub$cleaning_fee))
data_sub$extra_people <- as.double(gsub("[,$]", "", data_sub$extra_people))

# Replace "N/A" values, remove % and convert to percentage 
data_sub$host_response_rate <- na_if(data_sub$host_response_rate, "N/A")
data_sub$host_response_rate <- 
  as.double(gsub("[%]", "", data_sub$host_response_rate)) / 100
```

## Clean amenities 

The variable *amenities* contains all the amenities of the listing. However, this was stored in one large string and the elements could not be access separately. Therefore, we cleaned the string, spitted it so we had a list containing the separated elements. Furthermore, we extracted all unique amenities and stored these in a vector. 
```{r Generate all unique amenities}
# Clean and split strings for amenities
# Returns a list with all unique values
clean_amenities <- function(x) {
  subbed <- gsub('[{}\"]', "", tolower(x))
  splitted <- str_split(subbed, ",")
  clean <- sapply(splitted, function(x) str_trim(x, side = "both"))
  return(clean)
}

# Clean amenities
data_sub$amenities_clean <- 
  sapply(data_sub$amenities, function(x) clean_amenities(x))

# Create vector with all unique amenities
amenities_unique = c()
for(amenities in data_sub$amenities_clean) {
  for(element in amenities) {
    if(!(element %in% amenities_unique) & element != "") {
      amenities_unique <- append(amenities_unique, str_trim(element))
    }
  }
}

# Show result
amenities_unique
```

### Create new variables based information stored in amenities

With the vector of all unique amenities we could create variables for all the separated variables. However, since there are 130 usable amenities it seemed beyond the scope of the assignment. Moreover, some of the amenities contain information that is also given by other variables or about other amenities. For example a private bathroom could also be a strong indicator that the listing is an entire house/apartment. Also, shampoo or shower gel could be strong indicators for the presence of a bathroom. Furthermore, some are amenities are very specific and apply only to a few or one house. However, we created variables some of the amenities we think could have an influence on the price of a listing. We created variables for *wifi*, *pool*, *hot_tub* and *tv*. 
```{r Create new variables for amenities}
# Create variable for WIFI and add to data set
wifi <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("wifi" %in% data_sub$amenities_clean[[i]] | 
     "internet" %in% data_sub$amenities_clean[[i]]) {
    wifi[i] <-  "yes"
  } else {
    wifi[i] <-  "no"
  }
}
data_sub$wifi <- wifi
data_sub$wifi <- factor(data_sub$wifi, levels = c("yes", "no"))

# Create variable for pool and add to data set
pool <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("pool" %in% data_sub$amenities_clean[[i]] |
     "pool with pool hoist" %in% data_sub$amenities_clean[[i]]) {
    pool[i] <-  "yes"
  } else {
    pool[i] <-  "no"
  }
}
data_sub$pool <- pool
data_sub$pool <- factor(data_sub$pool, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
hot_tub <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("hot tub" %in% data_sub$amenities_clean[[i]]) {
    hot_tub[i] <-  "yes"
  } else {
    hot_tub[i] <-  "no"
  }
}
data_sub$hot_tub <- hot_tub
data_sub$hot_tub <- factor(data_sub$hot_tub, levels = c("yes", "no"))

# Create variable for hot_tub and add to data set
tv <- vector()
for(i in 1:length(data_sub$amenities_clean)) {
  if("tv" %in% data_sub$amenities_clean[[i]] |
     "cable tv" %in% data_sub$amenities_clean[[i]]) {
    tv[i] <-  "yes"
  } else {
    tv[i] <-  "no"
  }
}
data_sub$tv <- tv
data_sub$tv <- factor(data_sub$tv, levels = c("yes", "no"))
```

## Clean host verification methods

For the variable *host_verification* the same applies as to *amenities*. We have applied the same cleaning method, first we cleaned and splitted the strings, whereafther we generated a vector with all unique amenities.
```{r Generate unique host verification methods}
# Clean and split strings for host verification methods
# Returns a list with all unique values
clean_verficiations <- function(x) {
  subbed <- gsub("\\[|\\]", "", tolower(x))
  subbed_complete <- gsub("[']", "", subbed)
  splitted <- str_split(subbed_complete, ",")
  clean <- sapply(splitted, function(x) str_trim(x, side = "both"))
  return(clean)
}

# Clean host_verifications
data_sub$host_verifications_clean <- 
  sapply(data_sub$host_verifications, function(x) clean_verficiations(x))

# Generate list with all unique host verification methods
verifications_unique = c()
for(verifications in data_sub$host_verifications_clean) {
  for(element in verifications) {
    if(!(element %in% verifications_unique) & element != "") {
      verifications_unique <- append(verifications_unique, str_trim(element))
    }
  }
}

# Show result
verifications_unique
```

### Create variables based on information stored in host verifications

We created separate variables for the most common methods of verification, namely *email*, *phone*, *facebook* and *government_id*. 
```{r Create new variables for host verification method}
# Create variable for host email and add to data set
host_email <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("email" %in% data_sub$host_verifications_clean[[i]]) {
    host_email[i] <-  "yes"
  } else {
    host_email[i] <-  "no"
  }
}
data_sub$host_email <- host_email 
data_sub$host_email <- factor(data_sub$host_email, levels = c("yes", "no"))

# Create variable for phone and add to data set 
host_phone <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("phone" %in% data_sub$host_verifications_clean[[i]]) {
    host_phone[i] <-  "yes"
  } else {
    host_phone[i] <-  "no"
  }
}
data_sub$host_phone <- host_phone 
data_sub$host_phone <- factor(data_sub$host_phone, levels = c("yes", "no"))
 
# Create variable for host facebook and add to data set
host_facebook <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("facebook" %in% data_sub$host_verifications_clean[[i]]) {
    host_facebook[i] <-  "yes"
  } else {
    host_facebook[i] <-  "no"
  }
}
data_sub$host_facebook <- host_facebook 
data_sub$host_facebook <- 
  factor(data_sub$host_facebook, levels = c("yes", "no"))

# Create variable for government id 
host_government_id <- vector()
for(i in 1:length(data_sub$host_verifications_clean)) {
  if("government_id" %in% data_sub$host_verifications_clean[[i]]) {
    host_government_id[i] <-  "yes"
  } else {
    host_government_id[i] <-  "no"
  }
}
data_sub$host_government_id <- host_government_id 
data_sub$host_government_id <- 
  factor(data_sub$host_government_id, levels = c("yes", "no"))
```

## Clean date variables

We used the variable *host_since* to create a new variable *host_years_active*, which contains information on the number of years a host has been active on the platform. 
```{r Create new variable}
# Create new variable for active years host 
data_sub <- data_sub %>% 
  mutate(host_years_active = 
           as.double(as.Date("2019-12-07") - host_since) / 365)
```

## Inspect availability variables

The variables *availability_30*, *availability_60*, *availability_90* and *availability_365* carry some of the same information. In order to inspect if we should include all variables in the data set, we plotted a correlation matrix. The plot below shows that all variables indicating the availability are strongly correlated. Therefore, we only include the variable *availability_30* for further analysis. 
```{r}
# Plot correlation matrix
data_sub %>% select_if(is.numeric) %>% 
  select(availability_30, availability_60, availability_90, availability_365) %>% 
  cor() %>% corrplot()

# Remove other availability variables from data set
data_sub <- 
  data_sub %>% 
  select(-availability_60, -availability_90, -availability_365)
```

## Check missing data

We have create a table to inspect with variables have missing cases and how many. The table shows the variables in the data set that contain missing values (in descending order). The table shows that the variable *square_feet* has $19662$ missing cases, which is about $98.19\%$. If we would deleted the missing cases, the data set will barely contain any data. Moreover, other methods for handling missing values like replacing NA-values with the mean or median would not be appropriate since the variables will be based on only $1.81\%$ of the data. Therefore, *square_feet* is nog included in the final. The variable *host_response_rate* has $9349$ missing cases, which is about $46.69\%$. The variable *host_neighbourhood* has $5972$, which is about $29.82\%$. The variable *cleaning_fee* has $3604$, which is about $18.00\%$. The variables  *host_response_rate*, *host_neighbourhood* and *cleaning_fee* do not have as many missing values as *square_feet*, however, the same reasoning applies. As a result, these variables are also excluded from the analyses. 
```{r Missing data check}
# Count missing cases per variable
na_counter <- sapply(data_sub, function(x) sum(is.na(x)))
vars <- colnames(data_sub)

# Extract all variables with NA-values
na_values <- tibble(variables = vars, na_count = na_counter) 

# Check na count per variable
na_values %>%
  filter(na_count > 0) %>%
  arrange(desc(na_count))
```

Moreover, we have check how many cases contain missing values if the other values that have missing cases would be included in the 'final' data set. The table below shows that there are 206 cases which contain missing values, which is about $1.03\%$ of the entire data set. Since this is a very small proportion it is not very likely that deleting these causes would have a large impact on the predictions. Therefore, we delete the cases with missing values. 
```{r Incomplete cases}
# Compute incomplete rows
data_sub %>% 
  select(host_since, host_response_time, host_listings_count,
         host_identity_verified, beds, bedrooms, host_years_active, 
         bathrooms) %>% 
  complete.cases() %>% 
  summary(count())
```

```{r Generate final data set}
# Select variables for data set
variables_analysis <-   
  na_values %>% 
  filter(na_count <= 158) %>% 
  select(variables) %>% 
  pull(variables)

# Create final data set
data_semi_final <- data_sub %>% select(all_of(variables_analysis))
data_semi_final <- data_semi_final %>% 
  select(-c(amenities, amenities_clean, host_verifications,
            host_verifications_clean, host_since))
data_final <- data_semi_final[complete.cases(data_semi_final), ]
```

# Create train-test split

For further analysis the data is split into a train-test set.
```{r}
# Create a train-split sets
seed_x <-  123
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)
```

# Inspecting the predicted variable

In order to prevent data leakages we only inspect the predicted variable *price* in the training set. In order to inspect *price* we have created a distribution plot. The plot shows that the data set contains some outliers and that the distribution is rightly skewed. Both the outliers and the skeweness make the data less interpretable and this could have an influence on performance of the models. 
```{r}
# Plot distribution price
ggplot(data = data_train , aes(price)) +
  geom_histogram(col="black",
                 breaks=seq(0, max(data_train$price), by=75),
                 aes(fill=..count..)) +
  labs(title="Histogram for Price", x="Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red")
```

In order to prevent this potential problems we have created a new distribution plot with a log transformed variable *price*. The plot shows that the distribution is less skewed and does not contain any large outliers. Resulting, the data that is more interpretable. Therefore, log transforming we will use the log transformed *price* for our models. 
```{r}
# Plot distribution price
ggplot(data = data_train , aes(log(price + 1))) +
  geom_histogram(col="black",
                 aes(fill=..count..)) +
  labs(title="Histogram for Ln Price", x="Ln Price", y="Count") +
  scale_fill_gradient("Count", low="green", high="red")
```

```{r}
# Log transform the price in for both training and test set
data_final$price <- log(data_final$price + 1)

# Resplit the data using the same seed 
set.seed(seed_x)
data_split <- initial_split(data_final, prop = 0.7)
data_train <- training(data_split)
data_test <- testing(data_split)

```

# K-fold cross validation 

Moreover, we have generated 10-fold cross validation sets.
```{r}
# Generate 10-fold CV sets
set.seed(321)
data_folds <- vfold_cv(data_train, v = 10)
data_folds
```

# Remove data frames to avoid leakages and errors
```{r}
rm(data)
rm(data_sub)
rm(data_semi_final)
rm(na_values)
```


